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Abstract 

Background: The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous 
collection of epithelial proliferations currently classified based on morphology. Their role in the development of 
breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in 
breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is 
known about their molecular alterations. 

Results: To characterize the transcriptional changes of early breast neoplasia, we sequenced 3'- end enriched 
RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal 
breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are 
distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including 
elevated transcription of ERBB2, F0XA1, and GATA3 at this early stage. We validate these findings on a second 
independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of 
protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway 
regulators F0XA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia 
samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal 
gene expression patterns. 

Conclusions: This study is the first examination of global gene expression in early breast neoplasia, and the genes 
identified here represent candidate participants in the earliest molecular events in the development of breast 
cancer. 



Background 

Recent large genomic studies have identified and con- 
firmed numerous recurrent mutations and aneuploidies 
that stratify breast carcinomas across clinicopathologic 
features [1,2]. However, the events involved early in the 
evolution of normal breast tissue into invasive breast 
cancer are still poorly understood. Understanding the 
initiating driver events in breast cancer progression is a 
key goal in breast cancer research as it can lead to im- 
provements in early-stage diagnosis, treatment, and pre- 
vention strategies [3]. 
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While the molecular steps required for progression to 
invasive carcinoma are currently unclear, morphologic 
studies of breast biopsy tissue [4-6] suggest early neopla- 
sias such as flat epithelial atypia, atypical ductal hyperpla- 
sia (ADH), and possibly columnar cell lesion, represent 
direct precursors to ductal carcinoma in situ (DCIS), it- 
self a precursor for invasive ductal carcinoma (IDC) 
(Figure 1A). It has been shown that the presence of 
early neoplasias in breast biopsies increases the risk 
for breast cancer. However, predicting which patients with 
neoplasia will progress to invasive cancer remains difficult. 
Pre-invasive lesions diagnosed as ADH and DCIS are as- 
sociated with progression to invasive cancer in only a frac- 
tion of patients: 20% of ADH will be associated with IDC 
[5] and 50% of DCIS will progress to IDC [6]. This clinical 
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Figure 1 Early neoplasias in histological specimens. (A) Schematic of the morphology of early neoplasias, including columnar cell lesions 
(CCL) and atypical ductal hyperplasia (ADH), as well as later stage ductal carcinoma in situ (DOS), and invasive ductal carcinoma (IDC). Usual 
ductal hyperplasia (UDH) and non-proliferative fibrocystic change (NPFCC) present benign hyperplasias of the breast. Normal breast is shown as 
terminal ductal lobular units (TDLU). (B) Workflow diagram for lesion purification and RNA isolation. 



heterogeneity makes treatment of patients with early neo- 
plasia problematic and motivates research aimed at unco- 
vering the molecular mechanisms at play in these earliest 
stages of cancer development. 

We recently performed the first whole genome se- 
quencing of early neoplasias to examine the molecular 
changes during breast cancer evolution [7]. We found 
that early neoplasias have already acquired a significant 
number of genomic alterations: many of the early neo- 
plasias studied possess hundreds of single nucleotide 
mutations and several chromosome aneuploidies. Many 
of these alterations are observed in both the patients 
early neoplasia and associated invasive cancer in a sig- 
nificant fraction of instances (4 of 6 sequenced patients, 
4 of 14 early neoplasias). These findings indicate that a 
common ancestral clone develops mutations at a very 
early stage, before giving rise to both the early neoplasia 
and related cancer. While most of the single nucleotide 
variations were not shared between patients, gain of 
chromosome lq and activating mutations in PIK3CA 
were observed recurrently in some of the early neoplasia 
samples. A previous targeted study of cancer hotspot 
mutations also identified PIK3CA as a common mutation 
present in roughly half of early neoplasias, but not ne- 
cessarily correlated with progression to invasive carcino- 
ma [8]. These findings highlight the genetic heterogeneity 



among early neoplasias, consistent with the observed clin- 
ical heterogeneity in outcomes. Given a shared origin for 
early neoplasias and the adjacent invasive cancer, the early 
neoplasia mutations identified thus far represent molecu- 
lar events that may be important at this very early stage 
and suggest that further characterization of early neopla- 
sias represents a unique and promising tool for unco- 
vering additional alterations and elucidating the molecular 
mechanisms necessary for cancer initiation. 

Gene expression has been a particularly effective tool 
for classifying invasive breast cancer [9], perhaps be- 
cause it provides a downstream readout that reflects the 
combination of genetic and epigenetic alterations pre- 
sent in cells. To better understand the molecular events 
characterizing early neoplasia as a whole, we assessed 
the transcriptional changes present within a collection of 
early neoplasias from either patients possessing adjacent 
cancer or patients with no concurrent cancer. 

Results 

3SEQ gene expression profiling for early neoplasias, 
normal tissue, and adjacent cancer 

To evaluate gene expression in early neoplasias and 
allow for comparison with patient-matched normal and 
cancer samples, we studied archival samples of normal 
breast, early neoplasia, carcinoma in situ, and invasive 
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carcinoma from breast resection specimens (Figure 1A) 
[7,8]. The samples of carcinoma in situ, and invasive car- 
cinoma represent a spectrum of grade and molecular 
subclass (additional details are included in Table SI in 
Additional file 1). To ensure the purity of the cells 
within the samples, 2-mm diameter tissue cores were 
re-embedded in paraffin on their sides, re-sectioned, 
and stained to allow longitudinal examination of the cells 
across the depth of the tissue cores (Figure IB). Only sam- 
ples that possessed >90% of luminal cells with the appro- 
priate diagnosis were included among our samples. 

We employed 3SEQ (3 '-end sequencing for expression 
quantification) to quantify RNA from the formalin-fixed 
paraffin-embedded (FFPE) tissue core specimens. Previ- 
ously, we demonstrated the feasibility and utility of ap- 
plying this 3 '-end sequence tag counting method on 
RNA from FFPE material to obtain quantitative global 
gene expression data for both known mRNAs [10] and 
long non-coding RNAs (IncRNAs) [11] and to discover 
novel transcripts [11]. Here we collected RNA from 72 
samples (24 normal, 25 early neoplasia, 9 DCIS and 14 
IDC) from a total of 25 patients to study a collection of 
early neoplasias and enable comparisons across a panel 
of matched patient samples. For 16 patients, complete 
sets of matched normal, early neoplasia, and cancer (either 
DCIS or IDC) were present. For six of the cases, concur- 
rent cancer (DCIS or IDC) was not present. 3SEQ libraries 
were prepared from this collection of RNA samples and 
directional next-generation sequencing of these libraries 
yielded an average of 7 million uniquely mapping reads 
per sample (Table S2 in Additional file 1). Reads mapping 
to RefSeq genes (n = 22,775) and IncRNAs (n = 2,136) 
were counted for use in subsequent analysis (see Materials 
and methods). 

We generated a second dataset of gene expression pro- 
files for validation using an independent set of patient- 
matched breast neoplasia progression samples, including 
normal, early neoplasia, and IDC. Samples were obtained 
using methods identical to the original cohort. However, 
for this validation cohort we used full transcriptome 
RNAseq, not 3SEQ, with rRNA depleted non-polyA se- 
lected libraries. The new RNAseq dataset contains 42 
samples: 8 normal, 16 early neoplasia, and 18 cancer 
(IDC and DCIS) samples. The average mapped reads 
per sample was 61 million reads. 

Gene expression profiles define an early neoplasia 
transcriptional program 

With the gene expression data from the patient-matched 
early neoplasias, normal breast tissues, and breast can- 
cers, we characterized early neoplasia gene expression 
and compared the early neoplasias profiles to the normal 
and cancer samples. A three-class Prediction Analysis of 
Microarrays (PAM) classification analysis tailored to 



sequencing data (see Materials and methods) showed 
that normal, early neoplasia, and cancer samples can be 
correctly classified into their respective groups using 
only 44 genes, suggesting that these diagnoses represent 
distinct groups of samples that can be characterized 
using strong patterns of shared gene expression (un- 
paired three-class analysis; cross-validated misclassifica- 
tion rate =16.7%; Figure 2A,B; Tables S3 and S4 in 
Additional file 1; see Materials and methods for details). 
These 44 classification genes are representative of the 
global expression patterns for these samples, which show 
large expression differences common among the cancer 
samples relative to normal tissue and early neoplasia. 

We find similar results when excluding the DCIS cases 
in our analysis and only comparing the normal and early 
neoplasia samples to IDC. When classification is per- 
formed by PAMR [12] using only the 13 invasive cancer 
samples, along with 24 normal samples and 25 early neo- 
plasia samples we observe a cross-validation misclassifica- 
tion rate of 15.9% using 158 genes. Again gene expression 
of a relatively small number of genes appears sufficient to 
distinguish these entities. We used publicly available data 
from The Cancer Genome Atlas (TCGA) project to valid- 
ate our findings of our 44 classification genes. TCGA has 
made available RNAseq data from breast invasive carcin- 
oma samples and the matched normal samples. We ex- 
tracted the data for 20,502 genes in 45 breast invasive 
carcinoma samples and 16 matched normal samples, fil- 
tering and excluding 2,212 genes with insufficient num- 
bers of reads across samples. After applying SAMseq to 
detect significant genes and estimate the adjusted P- values 
(false discovery rate (FDR)), we find that in this TCGA 
data nearly all of the 44 genes are significant between the 
cancer and normal samples. Only two of the 44 significant 
genes (PMEPA1 and CCL19) from our cancer/neoplasia/ 
normal classification analysis are not significant in the 
TCGA dataset (adjusted P-values >0.05). Of the 42 sig- 
nificant genes, 33 are extremely significant (adjusted 
P- values <lE-5). These results validate 3SEQ as a quanti- 
tative method for analyzing breast cancer samples. 

When SAMseq was performed to characterize more 
completely the differentially expressed genes between the 
various groups in our 3SEQ dataset, we obtained 3,197 
genes that were differentially expressed in cancer versus 
normal (1,587 up-regulated and 1,610 down- regulated) 
and 2,597 genes showed differences between cancer and 
early neoplasia (1,276 up-regulated and 1,321 down- 
regulated; two-class paired SAMseq [13]; Table 1; Tables 
S5 and S6 in Additional file 1; see Materials and methods). 

In a comparison with our validation dataset, these 
findings are supported as we find 1,332 (42%) of the 
genes defining cancer versus early neoplasia are signifi- 
cantly differentially expressed (P-value <0.05, £-test) in 
the RNAseq data. Furthermore, we find that among the 
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Figure 2 Classification analyses show early neoplasia is distinct from normal tissue and cancer. (A) Cross-validated probabilities from the 
unpaired three-class PAM analysis predicting the diagnostic class of each sample. Classes are normal, early neoplasia, and cancer (includes DCIS 
and IDC). Samples are arranged as in Table S3 in Additional file 1. The pathologist-assigned diagnosis is listed above. The circles represent the 
predictions based on the gene expression of the 44 selected genes. The probabilities for each sample sum to 1. (B) Mean-centered, normalized 
heatmap showing 3SEQ transcript levels for each of the 44 genes selected in the three-class PAM analysis in (A). Samples are arranged as in (A). 
Red is high expression; green is low expression. (C) Cross-validated probabilities from the unpaired two-class (normal versus early neoplasia) PAM 
analysis for predicting the diagnostic class of each sample. Normal and early neoplasia samples are arranged as in Table S3 in Additional file 1. 
The pathologist-assigned diagnosis is listed above. The circles represent the predictions based on the gene expression of the 180 selected genes. 
The probabilities for each sample sum to 1. (D) Mean-centered, normalized heatmap showing 3SEQ transcript levels for each of the 180 genes 
selected in the two-class PAM analysis in (C). Samples are arranged as in (C). Red is high expression; green is low expression. 
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Table 1 RefSeq genes differentially expressed 





Genes up 


Genes 
down 


Total genes 
differentially 
expressed 


Normal versus early 
neoplasia 


565 (4.1%) 


190 (1.4%) 


755 (5.5%) 


Normal versus cancer 


1,587 (11.6%) 


1,610 (11.8%) 


3,197 (23.4%) 


Early neoplasia versus 
cancer 


1,276 (9.4%) 


1,321 (9.7%) 


2,597 (19.0%) 



Paired SAMseq analysis (n = 13,643), FDR <5%. Sixteen samples per class. 
Numbers and percentages reported are genes up- or down-regulated in early 
neoplasia relative to normal, and cancer relative to either normal tissue or 
early neoplasia, respectively. 



1,332 genes that are significant in both datasets, 1,323 
(99%) agree in the direction of change in expression 
(Table S14 in Additional file 1). 

The normal and early neoplasia samples display more 
similar global patterns of gene expression compared to 
the cancer samples, and in a three-way comparison with 
cancer, the differences between the normal and early 
neoplasia samples are obscured. To better characterize 
the differences distinguishing early neoplasia from nor- 
mal, we performed a second PAMR classification ana- 
lysis using only normal and early neoplasia samples to 
determine the genes needed to classify these samples 
into their respective groups. A minimum of 180 genes 
is required to correctly classify 85.7% of the normal 
and early neoplasia samples (unpaired two-class analysis; 
cross-validated misclassification rate = 14.3%; Figure 2C,D; 
Tables S3 and S4 in Additional file 1). When all genes dif- 
ferentially expressed between these two groups were iden- 
tified, 565 genes were up-regulated and 190 genes were 
down-regulated in the early neoplasias relative to normal 
(two-class paired SAMseq; Table 1; Table S7 in Additional 
file 1; see Materials and methods). Of these genes, 76% 
agree in the direction of expression change in the early 
neoplasia versus normal analysis comparison in the valid- 
ation data set (Table S15 in Additional file 1). These genes 
represent the set of transcriptional changes occurring re- 
currently across many patients as these lesions develop 
from normal breast to precursors of cancer. 

Early neoplasias express breast cancer genes 

Thousands of genes have been associated with breast 
cancer, but which of these are crucial for the early stages 
of cancer development remains uncertain. Those gene 
events shared by both breast cancer and early neoplasias 
represent a unique panel of candidates by which to ex- 
plore potential important events during the earliest stages 
of carcinogenesis. We first compared those genes differ- 
ing between early neoplasias and normal breast (n = 755; 
Table S7 in Additional file 1) with those genes identified 
as significantly altered between cancer and normal in our 
dataset (n = 3,197; Table S5 in Additional file 1). Over half 



of the neoplasia genes (456/755; 60%) overlapped the 
breast cancer gene list, and most of these genes showed 
expression changes in the same direction as the cancer 
genes relative to normal (310 genes up-regulated and 126 
genes down-regulated in both neoplasia and cancer rela- 
tive to normal). The list of genes up-regulated in both 
neoplasia and breast cancer includes well-known Cancer 
Gene Census genes [14] such as ERRB2, GAT A3, and 
MUC1, among others (Table S7 in Additional file 1). In 
fact, genes differentially altered in early neoplasias were 
enriched for breast cancer genes, including those genes 
comprising the intrinsic (21 of 306 genes; P = 3.3 x 10" 3 ; 
hypergeometric test) [15] and PAM50 (9 of 50 genes; 
P = 7.2x 10" 5 ; hypergeometric test) [16] gene signatures 
for breast cancer subtypes as well as genes included in 
the Genes-to-Systems Breast Cancer Database (151 of 
2,278; P = 9.8 x 10~ 13 ; hypergeometric test) (Table S7 in 
Additional file 1) [17]. 

Observing cancer gene alterations within early neo- 
plasias indicates that these genes may be important for 
establishing some of the earliest changes necessary to 
transform normal cells into pre-cancerous cells. Master 
regulators active at this early stage are of particular in- 
terest because of their power to affect entire transcrip- 
tional programs. Elevated levels of the proto-oncogene 
ERRB2, which encodes the HER2/neu protein, are often 
found in the context of amplified gene copy number in 
HER2+ breast cancers and are important for activating 
a number of signal transduction pathways and driving 
cancer progression [18]. While only three of the breast 
cancers in this study are HER2+ (as classified by clinical 
pathology immunohistochemistry (IHC) stain; Table SI 
in Additional file 1), ERRB2 transcript levels were ele- 
vated in several of the early neoplasia samples relative to 
normal and suggests a possible function for HER2 at this 
early stage. 

Early neoplasias are most often associated with estro- 
gen receptor-positive (ER+), luminal- type breast cancers, 
so events present in the early neoplasias may highlight 
genes important within the luminal cancer development 
pathway. Notably, altered gene expression was observed 
in early neoplasias for three of the 57 genes identified as 
commonly mutated in breast cancer by TCGA project 
[1]. These genes, FOXA1, GATA3, and MYB, all showed 
greater mutation rates in ER+ /luminal cancers relative 
to other breast cancer subtypes and, here, possessed up- 
regulated expression in early neoplasias relative to normal. 
These genes, therefore, may play a specific and important 
role in the early stages of ER+ /luminal breast cancer de- 
velopment. PIK3CA, on the other hand, is commonly mu- 
tated in both early neoplasias (approximately 50%) [8] and 
ER+/luminal breast cancers (approximately 20%) [1] but 
does not show significant expression changes in the early 
neoplasias. Additionally, when early neoplasias with a 



B runner et a I. Genome Biology 2014, 15:R71 
http://genomebiology.eom/2014/1 5/5/R71 



Page 6 of 16 



PIK3CA mutation were compared to early neoplasias 
without an identifiable PIK3CA mutation (Table SI in 
Additional file 1), no significant gene expression differ- 
ences were identified between the two groups (data not 
shown). This suggests that while PIK3CA activating mu- 
tations may be important for generating early neoplasia, 
they may not be a prominent player in promoting the 
progression to cancer at this early stage, despite known 
PIK3CA -associated expression differences at the carcin- 
oma stage [19]. 

Gain of a copy of chromosome lq is also a common 
event observed in breast cancers and this gain was previ- 
ously identified in 4 of 14 sequenced early neoplasias [7]. 
Fluorescence in situ hybridization (FISH) was performed 
on 10 early neoplasias in this study for which tissue was 
available. Two of ten early neoplasias showed a copy num- 
ber gain of chromosome lq when compared with chro- 
mosomes 2q and 3q (Table S8 in Additional file 1), yet 
amplified samples (n = 2) and wild-type samples (n = 8) 
were indistinguishable using gene expression classification 
with PAM (data not shown). Interestingly, 98 genes from 
across the genome were down- regulated in the samples 
with the lq gain compared to the wild- type samples 
(Table S9 in Additional file 1), suggesting this recurrent 
genetic event may cause global expression differences 
within the early neoplasias. 

Gene ontology analysis identified several pathways that 
do not involve known breast oncogenes that are enriched 
in early neoplasia versus normal (Table S12 in Additional 
file 1). These include pathways that influence membrane 
transport, including endocytosis ABC transporters. 

Elevated estrogen receptor pathway genes in most early 
neoplasias 

Despite genetic heterogeneity of the early neoplasias 
in this study (chromosome lq and PIK3CA), a shared 
program of transcriptional modifications appears to 
characterize early neoplasias as a whole. Given the as- 
sociation of early neoplasias with ER + breast cancers, 
we wished to further characterize two important tran- 
scription factors, FOXA1 and GATA3, thought to act 
upstream of ER in the estrogen response pathway. Both 
genes showed elevated transcript levels in both the early 
neoplasia and cancer samples. In fact, FOXA1 and GATA3 
showed expression levels up-regulated at least 50% in 19 
(86%) and 16 (73%) of 22 patients, respectively, when indi- 
vidual matched normal and neoplasia levels were com- 
pared (Table S7 in Additional file 1). 

To examine the expression of these potential master 
regulators in an independent collection of early neopla- 
sia cases, we performed IHC on a large number of sam- 
ples from 104 patients. Six tissue microarrays possessing 
various combinations of normal breast, early neoplasia, 
DCIS, and IDC within 534 tissue cores were constructed 



for this study. These tissue microarrays were stained for 
ER, FOXA1, and GATA3, as well as with hematoxylin 
and eosin (Figure 3A-C) to confirm the diagnoses within 
each tissue core. Cores were scored for protein expres- 
sion by estimating the fraction of cells staining positive 
for each tissue core, creating a dataset with multiple sam- 
ples per diagnosis for several of the patients (Figure 4A; 
Table S10 in Additional file 1). Multiple samples from the 
same patient were sometimes heterogeneous in staining 
(patient 45; Figure 4A) and sometimes remarkably similar 
with a given diagnosis (patient 62; Figure 4A). 

Strikingly, and as a validation for the ability of 3SEQ 
to identify expression differences of luminal cells from 
samples derived from epithelial and stromal cell mix- 
tures, only luminal cells were observed to stain positive 
using the antibodies against FOXA1 and GATA3, as well 
as ER (Figure 3D-L). Additionally there was a significant 
trend for early neoplasias to have a much higher fraction 
of cells staining positive for FOXA1 and GATA3 than 
normal, with positive cell fractions similar to those ob- 
served in cancer (Figure 4B,C; Table S10 in Additional 
file 1). This confirms that FOXA1 and GATA3 expres- 
sion is commonly and recurrently elevated in the ma- 
jority of early neoplasias (40/59 cases strong and 15/59 
cases intermediate staining for FOXA1; 31/57 cases 
strong and 21/57 cases intermediate staining for GATA3). 
While the ER gene (ESR1) did not show significantly al- 
tered transcript levels within either the early neoplasias or 
breast cancers in our sequencing dataset, IHC staining for 
ER showed elevated protein levels in both the early neo- 
plasias (57/61 cases with strong or intermediate staining) 
and breast cancers (79/99 with strong or intermediate 
staining; Figure 4C; Table S10 in Additional file 1). When 
these results were compared with IHC stains of FOXA1 
and GATA3 on the same cases, there was significant asso- 
ciation of positive FOXA1 and GATA3 cases with cases 
that were also positive for ER (Figure SI in Additional 
file 2), suggesting an up-regulation of the ER pathway, 
possibly driven by FOXA1 and GATA3, may be an im- 
portant early event in the majority of early neoplasias. 

Long non-coding RNAs modified in early neoplasias 

IncRNAs are gaining attention as potential regulators in 
cancer development. When 1,376 of the previously de- 
fined cancer-expressed IncRNAs [11] were analyzed in 
this study, we observed similar fractions of IncRNA tran- 
scripts experiencing modified expression in early neo- 
plasias relative to normal (4.4% IncRNAs up-regulated; 
0.7% IncRNAs down-regulated; compared with 4.1% 
RefSeq genes up-regulated and 1.4% RefSeq genes down- 
regulated; Tables 1 and 2; Table Sll in Additional file 1), 
indicating no global enrichment for IncRNA modifica- 
tions compared with RefSeq genes in this early stage of 
cancer development. Interestingly, fewer IncRNAs showed 
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Figure 3 ER, GATA3, F0XA1, and transcript 13741 are up-regulated in early neoplasia and cancer. Images of IHC for ER, GATA3, and 
F0XA1 for a matched set of normal breast, early neoplasia, and invasive ductal carcinoma (donor ID 22666) at 400x. (A) Hematoxylin and eosin 
(H&E) staining of normal tissue; (B) H&E of early neoplasia; (C) H&E of invasive ductal carcinoma; (D) ER in normal tissue; (E) ER in early neoplasia; 
(F) ER in invasive ductal carcinoma; (G) GATA3 in normal tissue; (H) GATA3 in early neoplasia; (I) GATA3 in invasive ductal carcinoma; (J) F0XA1 
in normal tissue; (K) F0XA1 in early neoplasia; (L) F0XA1 in invasive ductal carcinoma (M-0) RNA in situ hybridization for transcript 13741 in 
normal tissue (M), early neoplasia (N), and invasive ductal carcinoma (O). 



modifications between early neoplasia and cancer (2.0% 
IncRNAs up-regulated; 4.8% IncRNAs down-regulated; 
compared with 9.4% RefSeq genes up-regulated and 9.7% 
RefSeq genes down-regulated; P < 1 x 10" 10 ; hypergeome- 
tric test; Tables 1 and 2; Table Sll in Additional file 1), 
indicating that fewer IncRNAs may be involved in the 
stages of cancer development following the early neoplasia 
stage. Of the numerous IncRNAs and novel transcripts 
up-regulated at the early neoplasia stage, the previously 
identified novel transcript (transcript 13741) whose tran- 
scription was specific to breast tissue and highly correlated 
with ER + breast cancers [11] was the highest ranking dif- 
ferentially up-regulated transcript in early neoplasias re- 
lative to normal within our sequencing dataset (paired 
SAMseq; Table Sll in Additional file 1; see Materials 
and methods). When visualized using RNA in situ hy- 
bridization, this transcript showed elevated expression 
in early neoplasia and cancer compared with normal 
(Figure 3M-0), with staining patterns correlating well 
with the elevated staining of ER, FOXA1, and GATA3 
in the majority of neoplasia cases (data not shown). This 
suggests that transcript 13741 may have a role in the ER/ 
FOXA1/GATA3 pathway being activated in early neopla- 
sias and maintaining elevated levels within cancer. 

Stromal microenvironment varies in early neoplasias 

The tumor microenvironment is known to play an im- 
portant role in tumorigenesis. The presence of stromal 
cells within the tissue cores assayed by 3SEQ in this 



study provides the unique opportunity to examine dif- 
ferences of gene expression resulting from the tumor 
microenvironment of our samples. We identified the 
immunoglobulin kappa chain (IGKC), a B-cell immune 
gene gaining attention as a stromal biomarker predictive 
of prognosis in breast cancer and other carcinomas [20], 
as highly and differentially expressed within the sequen- 
cing results from our samples, decreasing in early neo- 
plasias relative to normal tissue (Table Sll in Additional 
file 1). When stained using RNA in situ hybridization, this 
transcript was absent from luminal cells and specifically 
stained immune cells situated in the tissue stroma (Figure 
S2 in Additional file 2). Notably, early neoplasias showed 
much less staining and were associated with fewer im- 
mune cells (Figure S2 in Additional file 2 and data not 
shown). This example demonstrates that non-luminal cell 
transcription contributes to and can be detected within 
our 3SEQ dataset, and suggests that immune cell associ- 
ation with early neoplasias, or lack therefore, is a recurrent 
feature at this early stage whose significance will require 
further study. 

Fibroblastic change is another stromal feature observed 
within breast cancer, and the stromal genes comprising 
the DTF gene expression signature can stratify breast can- 
cers prognostically [21]. Here we applied the core DTF 
gene set to identify the DTF-positive (fibroblast response) 
and DTF-negative (absence of fibroblastic response) breast 
cancers in our analysis (Figure 5A) and compared these 
patterns of expression with those from the early neoplasias 
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Figure 4 ER, FOXA1, and GATA3 are elevated in most early neoplasias and cancer. (A) Visualization of F0XA1 IHC scores for each patient 
and diagnosis. Each scored sample (tissue core) is a colored box. When multiple samples were scored for a given patient, the boxes are stacked. 
Patients are roughly sorted by early neoplasia scores. Patients are ordered the same for normal, early neoplasia, and cancer samples, so columns 
are aligned for patient-matched comparisons between diagnoses. Median IHC scores were used to obtain a single score for each diagnosis per 
patient. (B-D) These patient scores are plotted for F0XA1 (B), GATA3 (C), and ER (D). 



(Figure 5B) and normal tissue (Figure S3 in Additional 
file 2). Early neoplasias clustered using the core DTF 
gene set fell into three different groups: those showing co- 
ordinated co-expression of the DTF genes (DTF-positive; 
n = 12), those with an inverse DTF expression pattern as 
in DTF-negative cancer (n = 3), and those cases that were 
not clearly positive or negative (n = 10; Figure 5B). While 

Table 2 IncRNAs differentially expressed 



IncRNAs IncRNAs Total IncRNAs 

up-regulated down-regulated differentially 

expressed 



Normal versus early 


61 (4.4%) 


9 (0.7%) 


70 (5.1%) 


neoplasia 








Normal versus 


118 (8.6%) 


106 (7.7%) 


224 (16.3%) 


cancer 








Early neoplasia 


27 (2.0%) 


66 (4.8%) 


93 (6.8%) 


versus cancer 









Paired SAMseq analysis (n = 1,376), FDR <5%. Sixteen samples per class. 
Numbers and percentages reported are IncRNAs up- or down-regulated in 
early neoplasia relative to normal, and cancer relative to either normal or early 
neoplasia, respectively. 



there is a trend for patients with DTF-positive cancers 
to also show DTF-positive early neoplasias and nor- 
mal samples, and vice versa, the correlation is not ab- 
solute (Figure 3 in Additional file 2). These observations 
suggest that prognostic features of the tumor microenvir- 
onment typically observed in later stage breast cancers are 
also present within some early neoplasia and normal sam- 
ples, and its significance in the early stages of breast can- 
cer development will need further examination. 

Discussion 

The early stages of breast cancer development are poorly 
understood. While a number of environmental factors 
leading to breast cancer have been identified [3], there is 
a significant gap in our understanding of how these pro- 
cancer environmental factors function and how their in- 
fluences manifest in breast cancer development. Existing 
genomic studies have largely focused on invasive carcin- 
omas and metastasis, which have undergone significant 
genomic changes that likely include early ones that func- 
tioned in initiating oncogenesis, later ones that may confer 
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cancer early neoplasia 



Figure 5 Early neoplasias show altered DTF core gene 
expression. (A,B) Heatmaps showing cancer (A) and early neoplasia 
(B) samples hierarchically clustered using the DTF core gene 
signature [20]. Samples were classified as DTF + or DFT- based on 
the coordinated up-regulation (red) or down-regulation (green), 
respectively, of approximately 40 genes. 



invasive and metastatic phenotypes, as well as many 
genomic changes of no functional consequence. While 
we know which mutations, aneuploidies, and expres- 
sion changes occur in carcinoma, we have little insight as 
to the dynamics of the genomic changes. We do not know 
when these changes occur, in what order they occur, and 
how they influence the risk of becoming cancer. 

Morphologic studies have indicated that early neopla- 
sias, such as flat epithelial atypia and ADH, and possibly 
less advanced early neoplasias represent direct precur- 
sors to DCIS and invasive carcinoma. The molecular 
profiles of early neoplasias are likely to be enriched with 
genetic events that initiate oncogenesis compared with 
molecular profiles of more advanced neoplasias like 
DCIS and IDC. No large-scale study of genomic altera- 
tions has been performed for neoplasia progression, and 
prior research on pre-invasive breast neoplasias has largely 
focused on DNA copy number changes [22-25], without 
regard for matched progression to invasive carcinoma. 
Our recent full-genome sequencing study of 31 samples 
from 6 patients, including 14 early neoplasias, provided 
the most complete picture to date of mutations and aneu- 
ploidies present within early neoplasias, and definitively 
established a genetic relationship between early neoplasias 



and the adjacent invasive cancer [7]. Importantly, these 
cancer-associated early neoplasias have already acquired 
aneuploidies, common within breast cancer, and hundreds 
of mutations, suggesting that critical oncogenic events are 
occurring at this early stage. Thus, some of the early neo- 
plasias studied in this manuscript likely represent epi- 
thelium primed for critical oncogenic steps and others 
represent direct precursors. It is clear then that we are 
dealing with a mixture of cases that are related to the IDC 
(and are associated with progression) and cases that are 
not related. While this uncertainty limits the precision of 
this study, this feature also makes this neoplastic class 
interesting. 

Examination of gene expression within early neoplasias 
is difficult, as these small lesions can only be identified 
under the microscope. As a result, techniques for gene 
expression profiling that require fresh frozen material 
cannot be used for most instances of early neoplasia. 
Previously, we demonstrated the feasibility and utility of 
using archival RNA from FFPE material to obtain quan- 
titative global gene expression data using 3SEQ [10,11]. 
By measuring gene levels using only the 3 ' ends of tran- 
scripts, we were able to quantitatively profile fragmented 
RNA from archival specimens. The samples surveyed 
within these previous studies, however, represented well- 
established carcinomas and sarcomas for which large 
amounts of input material were available for profiling. In 
this study we demonstrate for the first time effective 
3SEQ profiling of the much smaller early neoplasia sam- 
ples, for which reduced amounts of input total RNA are 
available. By applying 3SEQ to early neoplasias as well as 
patient-matched normal and cancer samples, we provide 
for the first time a detailed look into the global gene ex- 
pression of early neoplasias, including how they differ 
from matched normal and cancer samples, and we shed 
light on the expression changes that characterize this 
early stage of luminal cell transformation on the pro- 
gression to invasive cancer. 

Our study showed that despite the genetic heterogen- 
eity in the early neoplasias, a strong pattern of shared 
transcriptional modifications can be detected across early 
neoplasias as a whole. Expression differences can distin- 
guish early neoplasias from both normal breast tissue and 
cancer, and hundreds of genes and tens of IncRNAs 
characterize this early stage of progression. Early neopla- 
sias with and without adjacent cancer were indistinguish- 
able in our analyses. Enrichment analysis using the genes 
modified during the progression from normal to early 
neoplasia suggests involvement of a number of biological 
processes (Table S12 in Additional file 1; see Materials 
and methods). 

Numerous known breast cancer genes show modified 
transcription within the early neoplasias. Elevated levels 
of the ERBB2 (HER2) transcript were unexpected in 
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these samples because only three of the associated IDCs 
had HER2 amplification. While HER2 amplification has 
been found to occur at the ADH stage, the morphologic 
stage that may represent the stage just after the early 
neoplasias profiled here [26], amplification is not likely 
to explain the increased transcript levels for the majority 
of the early neoplasias studied here. Instead the elevated 
ERBB2 transcript levels in early neoplasias suggest that 
HER2 may be functioning in a non-amplified setting. 
This is reminiscent of the recent work demonstrating 
the importance of elevated HER2 levels in non-amplified 
HER2 breast cancer stem cells [27] and suggests that 
HER2 may be playing a role in the early stages of cancer 
development, setting the stage for future oncogenic 
events. Our findings that ER, FOXA1, and GATA3 are 
all expressed at levels comparable to their matched IDC 
suggest that the oncogenic nature of ER pathway acti- 
vation is established in the earliest stages of cancer de- 
velopment. Recent literature suggests that FOXA1 may 
be a 'pioneer' factor necessary for establishing chromatin 
accessibility of ER to its target genes [28]. Therefore, 
FOXA1 may be one of the important early events in ER 
pathway activation, setting the stage for subsequent can- 
cer development. The role for the transcription factor 
GATA3 at this early stage is less clear. Work has sug- 
gested that GATA3 may act as a differentiation factor 
within breast cells that, when lost, typically through mu- 
tation, allows cancer progression [29]. GATA3 levels 
were elevated in both the early neoplasias and cancers 
in this study and were highly correlated with ER and 
FOXA1 expression, but mutation status in these samples 
is unknown. ER pathway activation within the early neo- 
plasias suggests that this event alone is not sufficient for 
a cancer phenotype but may be important for priming 
cells for later events necessary for the development of 
cancer. 

The effects of early mutations on gene expression chan- 
ges were not readily discernible. Activating mutations of 
PIK3CA are present in 36% of all breast cancers, signifi- 
cantly enriched within luminal A breast cancers (45%) [1], 
and function by activating the PI3K/AKT pathway to alter 
a number of cellular processes, including cell proliferation, 
differentiation, and survival [30]. PIK3CA was previously 
detected as mutated in 13 of the 20 early neoplasias pro- 
filed within this study (Table SI in Additional file 1) [8]. 
Strikingly, despite this well-known, common mutation 
target, we were not able to detect associated expression 
changes. Transcript levels of PIK3CA were not signifi- 
cantly different between mutant and wild-type early neo- 
plasias, and these two groups did not show any of the 
previously described transcriptional changes found in IDC 
[19]. Gain of chromosome lq is another common alter- 
ation observed in breast cancers. Our prior sequencing 
analysis identified this as a common aneuploidy within the 



early neoplasias examined [7]. Here we used FISH to 
identify the chromosome lq gain in one of the cases 
previously sequenced as well as one other early neo- 
plasia sample out of the 10 early neoplasia cases evaluated. 
While we were unable to classify these two groups based 
on gene expression differences, we did identify a list of 98 
genes from across the genome all down- regulated in the 
chromosome lq samples versus wild type. Aneuploidy 
work in yeast has shown that copy number alterations of 
chromosome arms can effect gene expression globally and 
is not limited to genes on the effected chromosome 
[31,32]. Our preliminary results here also indicate that the 
gain of chromosome lq may have global effects on gene 
expression, and given its prevalence in early neoplasias as- 
sociated with cancer, it may be important for the earliest 
stages of cancer development. 

Previously we profiled the expression of IncRNAs within 
a variety of cancer types to determine global patterns 
of IncRNA expression in cancer [11]. IncRNAs are 
gaining attention for their regulatory roles within can- 
cer. The prototype IncRNA is HOX antisense intergenic 
RNA (HOTAIR), a transcript shown to regulate HOX gene 
transcription and whose elevated levels in breast cancer 
are associated with patient outcome [33]. The search for 
additional IncRNA regulators with roles in cancer has led 
numerous studies to catalog and profile IncRNAs in a 
number of tissues and settings. Here we examined expres- 
sion of the cancer-expressed IncRNAs within early neopla- 
sias to identify global patterns of IncRNA usage in this 
early stage of cancer development. We identified numer- 
ous IncRNAs expressed within early neoplasias and differ- 
entially expressed when compared to normal tissue and 
cancer. Interestingly, the depletion of differentially ex- 
pressed IncRNAs between early neoplasia and cancer sug- 
gests that IncRNAs may play more of a role early in the 
progression process, at the early neoplasia stage, and less 
of a role during later stages of cancer development. We 
had previously identified nuclear transcript 13741, along 
with coordinately expressed transcripts from the same 
100 kb region on chromosome 10, to be highly expressed 
in breast cancer and correlated with ER + breast cancer 
[11], and others have found it associated with breast can- 
cer recurrence [34]. Here we observed that this transcript 
is significantly elevated in early neoplasias compared to 
normal tissue using both 3SEQ and RNA in situ hy- 
bridization, and it shows correlated expression with the 
ER pathway members ER, FOXA1, and GATA3. 

This study has identified a number of transcriptional 
features that characterize early neoplasias and provides 
insight into the molecular mechanisms at work during 
this early stage of neoplastic transformation. In our stu- 
dy, many of the early neoplasia samples were derived 
from specimens that also had the matched cancer sam- 
ple. While we cannot determine relatedness based on 
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gene expression profiling, we do have some insight into 
the relatedness of this class of sample from our prior 
whole genome sequencing of early neoplasia study [7]. 
By sequencing a series of patient- matched specimens 
similar to those in the current manuscript and compar- 
ing single nucleotide variations and validating with copy 
number changes, we have created evolutionary trees of 
the different breast neoplasia samples from normal breast 
to early neoplasia to DCIS to IDC. In that study, we found 
that 4 of 14 early neoplasias were evolutionarily related to 
the invasive cancer (two of these cases are represented in 
the current manuscripts cohort). Thus, some of the early 
neoplasias studied in this manuscript likely represent epi- 
thelium primed for critical oncogenic steps and others 
represent direct precursors. Given the morphologic di- 
versity of early neoplasias, their differences in association 
with both concurrent and future cancer, and the recently 
described genetic differences in lq gain and PIK3CA mu- 
tation frequencies, we were struck by the similarity of 
their transcriptional profiles. The differences separat- 
ing early neoplasias from both normal tissue and cancer 
were larger than any differences examined between sub- 
groups of early neoplasias, even when corrected for smal- 
ler group sample sizes. Unsupervised clustering identified 
no robust subgroups of early neoplasias. Instead, the early 
neoplasias in our study show a common set of expression 
changes that appear to characterize the early neoplasias as 
a whole. Elevated levels of the known breast cancer genes 
along with a host of others are selectively modified at very 
early stages in the neoplastic transformation process in 
most cases. The events that drive further progression to 
cancer in a fraction of cases remain to be determined, but 
activation of the ER pathway early on may prime cells for 
additional events necessary for cancer progression. Im- 
proved characterization of these early molecular events 
with cancer progression with future studies will hopefully 
pave the way for detection and prevention tools to im- 
prove patient care. 

Conclusions 

This study represents the first global examination of 
gene expression within early neoplasias, and identifies 
several features that appear to characterize early neopla- 
sias as a whole and represent insights into this very early 
stage of cancer development. 

Materials and methods 

Samples 

Tumor and normal samples were collected using HIPAA 
compliant Stanford University Medical Center institu- 
tional review board approved written informed consent. 
Some of the tissues already existed in tissue banks and 
fall under exemption 4. The FFPE tissue blocks were ar- 
chived with the Stanford University and Oregon Health 



and Sciences University Departments of Pathology. His- 
topathological sections of breast resection specimens 
were screened for the presence of early neoplasia (spe- 
cifically columnar cell change with and without atypia) 
by RBW. For tissue blocks possessing early neoplasia, 
2-mm diameter core punches were used to collect dense 
areas of neoplasia and adjacent normal breast epithelial 
content separately. When DCIS and/or IDC were pre- 
sent, these were also cored. Three to ten tissue cores 
were pooled for each diagnosis per patient, and tissue 
cores were re-embedded in paraffin and re-evaluated 
histologically for lesion purity by sectioning length-wise 
along the core. Longitudinal examination of the cells 
across the depth of the tissue cores enabled us to ob- 
serve contamination of our early neoplasia samples with 
normal breast epithelium or cells of any other pathology 
(ADH, DCIS, or IDC). Only samples that possessed >90% 
of luminal cells with the appropriate diagnosis within the 
epithelial compartment were included among our sam- 
ples. Adjacent normal cores contained no identifiable pa- 
thologies, and cancer cores possessed less than 5% early 
neoplasia or normal epithelial cells. In all cases stromal 
cells were present in the core samples. DNA and total 
RNA were purified from paraffin slices of the embedded 
tissue cores using the RecoverAll Total Nucleic Acid 
Isolation Kit (Ambion/Life Technologies, Austin, TX, 
USA, catalog #1975). Briefly, deparaffination with a xylene 
incubation was followed by an ethanol wash and protease 
digestion. DNA was used for analysis of hotspot mutations 
and published separately [8]. Here, total RNA was used 
for 3SEQ library preparation. 

3SEQ library preparation and sequencing 

We optimized the 3SEQ library preparation method to 
work with reduced amounts of input total RNA to allow 
for profiling of the much smaller early neoplasia sam- 
ples. 3SEQ libraries were prepared from as little as 5 ug 
total RNA. RNA was enriched for 3' ends by using the 
Oligotex mRNA mini Kit (QIAgen, Valencia, CA, USA, 
catalog #70022). The 3' poly(A)-enriched RNA pools 
were made into directional 3SEQ Illumina sequencing 
libraries, as previously described [10,11] and summa- 
rized here. RNA that was not sufficiently fragmented was 
heat-sheared to a size of approximately 100 to 200 bases 
by incubation with First Strand Buffer (Invitrogen/Life 
Technologies, Grand Island, NY, USA, catalog #18080- 
044) and oligo-dT-P7 primer (5'-CAA GCA GAA GAC 
GGC ATA CGA GCT CTT CCG ATC TTT TTT TTT 
TTT TTT TTT TTT TTT TVN-3 ) at 85°C for 3 to 5 mi- 
nutes, depending on the extent of fragmentation required. 
The mixture was cooled to 50°C and first strand cDNA 
synthesis was performed in a 20 ul reaction using Su- 
perscript III Reverse Transcriptase (Invitrogen/Life Tech- 
nologies, catalog #18080-044) for 1 hour at 50°C. Second 
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strand cDNA was synthesized using Second Strand Buffer 
(Invitrogen/Life Technologies, catalog #10812-014), DNA 
ligase (Invitrogen/Life Technologies, catalog #18052-019), 
DNA polymerase I (New England Biolabs, Ipswich, MA, 
USA, catalog #M0209L), and RNaseH (Epicentre Biotech- 
nologies/Illumina, Madison, WI, USA, catalog #R0601K), 
and purified using the QIAquick PCR Purification Kit 
(QIAgen, catalog #28104). 3' End repair and modifi- 
cation used dATP (Invitrogen/Life Technologies, cata- 
log #10216018), Klenow exo (New England Biolabs, 
catalog #M0212L) and Klenow Buffer (NEB Buffer 2, 
New England Biolabs, catalog #B7002S), and was puri- 
fied using the QIAgen MinElute PCR Purification Kit 
(QIAgen, catalog #28204). cDNA was then ligated to the 
double-stranded P5 linker (Adapter Oligo Mix; Illumina, 
San Diego, CA, USA) for 15 minutes at 22°C. SPRI bead 
purification (Agencourt Biosciences/Beckman Coulter, 
Beverly, MA, catalog #A63880) was performed both prior 
to and following PCR amplification of the linker-ligated 
cDNA. PCR reactions with 5 -AAT GAT ACG GCG ACC 
ACC GAG ATC TAC ACT CTT TCC CTA CAC GAC 
GCT CTT CCG ATC T-3' and 5-CAA GCA GAA GAC 
GGC ATA CGA GCT CTT CCG ATC-3' primers and 
the Phusion PCR Master Mix (New England Biolabs, 
catalog #M0531) used a 15 cycle program (98°C for 
30 seconds; 15 cycles of 98°C for 10 seconds, 65°C 
for 30 seconds, 72°C for 30 seconds; 72°C for 5 minutes). 
The libraries were then size selected for 200 to 300 bp 
fragments by agarose gel fractionation (3% Nusieve GTG; 
Lonza/ Fisher Scientific, Pittsburgh, PA, USA) and purified 
using the QIAquick Gel Extraction Kit (QIAgen; cata- 
log #28704). The libraries contain P7 sequence at the 
3' end downstream of poly(A) and P5 at the 5' end. 
3SEQ libraries were sequenced with Illumina GAIIx 
machines to obtain 36-base directional, single-end se- 
quence reads. The first 25 bases of the sequence reads 
were mapped to hgl8 with a two-mismatch allowance 
using ELAND (Illumina) and further filtered to remove 
mapping artifacts caused by ambiguous mapping, as previ- 
ously described [11]. A minimum of 1.4 million uniquely 
mapping reads were obtained for each library (Table S2 in 
Additional file 1). Reads within exons of RefSeq genes 
(n = 22,775 genes; downloaded July 2011 from the 
University of California Santa Cruz Genome browser) 
were tallied to obtain the total number of reads per 
gene for each sample (see Additional file 3 for scripts). See 
Gene Expression Omnibus accession [GEO:GSE47462] 
for raw fastq files and the file of raw gene counts. 

RNAseq library preparation and sequencing 

Samples were obtained using methods identical to the 
original cohort. However, for this validation cohort we 
used full transcriptome RNAseq, not 3SEQ, with rRNA 
depleted non-polyA selected transcriptome library. Total 



RNA was extracted and purified from FFPE materials 
with a commercially available kit (Recover All Total Nu- 
cleic Acid Isolation Kit; Ambion, catalog #AM1975). 

Both cytoplasmic (nuclear-encoded) and mitochon- 
drial rRNA were depleted using the Ribo-Zero Magnetic 
Human Gold Kit (Epicentre, Madison, WI, an Illumina 
Company; catalog #MRZG12324). Input total RNA (1 to 
4 ug) was incubated with removal solution containing 
specific probes according to instructions and incubated 
at 68°C for 10 minutes. Cytoplasmic and rRNA bound to 
probes were removed by magnetic bead pull-down. The 
final ribosomal-depleted RNA was recovered after so- 
dium acetate /glycogen addition and ethanol precipita- 
tion overnight. Samples were centrifuged at 10,000 g for 
30 minutes, washed once per instruction, and resuspen- 
ded with 17 ul of EPF (Elute, Prime, Fragment Mix, from 
the kit of TruSeq RNA sample Preparation v2, Illumina; 
catalog #RS-122-2001). RNA was fragmented at 94°C for 
15 seconds. The remaining library construction was per- 
formed as above. The libraries were quantified with Qubit 
2.0 Fluoro meter (Life Technologies, Foster City, CA, 
USA) and validated with BioAnalyzer 2100 (Agilent). 

Libraries were sequenced with Illumina HiSeq machines 
to obtain 101-base directional, paired-end sequence reads. 
These 101 bp paired-end reads were uniquely mapped 
to hgl9 using Bowtie2/TopHat2 with the default set- 
tings [35]. Duplicates were removed using Picard. Frag- 
ments per kilobase of exon per million fragments mapped 
(FPKM) for gene expression was calculated using Cuf- 
flinks v2.1.1 [36] (see Additional file 4). 

Classification analysis 

Genes with an average of less than one read per sample 
(<72 reads across all samples) were removed from the 
dataset, resulting in a dataset of 15,748 (>72 reads across 
the samples) genes that was then normalized and trans- 
formed using package SAM 2.0 [13]. Briefly, the counts 
from the data were transformed using Anscombe trans- 
formation [37] to stabilize variance, and then normalized 
using sequencing depths estimated by PoissonSeq [38]. 
The resulting normalized data are roughly Gaussian- 
distributed and the values are comparable across sam- 
ples [12], resembling microarray data. This dataset was 
used to perform a three-class (unpaired: normal, early 
neoplasia, cancer) PAM analysis [12] using 10-fold cross- 
validation on the 72 samples. Briefly, PAM shrinks the 
centroids of gene expression of each class to their overall 
centroid and classifies the samples by the nearest centroid 
rule. The classifier only uses a sparse set of genes whose 
class-specific centroids are different from the overall cen- 
troid after shrinkage, and the size of the sparse set is 
chosen by minimizing the cross-validation error. DCIS 
and invasive cancers were combined into the 'cancer' class 
for this study. A two class (unpaired) PAM analysis was 
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also performed using 10-fold cross-validation on the 49 
normal and early neoplasia samples (see Additional file 3 
for scripts). Validation of using the TCGA RNAseq data 
from breast invasive carcinoma samples and the matched 
normal samples utilized the level 3 data from batch 93. 
This dataset contains expression levels measured by RNA- 
SeqV2 of 20,502 genes in 45 breast invasive carcinoma 
samples and 16 matched normal samples. We filtered 
2,212 genes as they have too few numbers of reads across 
samples. We then used SAMseq to detect significant 
genes and estimate the adjusted P- values (FDR). 

Differentially expressed genes 

To leverage the patient-matched samples in this study, 
we examined data from the subset of patients posses- 
sing samples of all three diagnostic classes: normal, early 
neoplasia, and cancer (DCIS or invasive). Right and left 
breast samples obtained from one patient were treated 
as independent cases. For the three patients who had 
more than one sample for a diagnostic class (two cancer 
samples, for example), we used the sample with the lar- 
ger sequencing depth. In this way, we developed a trun- 
cated dataset with a single instance of matched normal, 
early neoplasia, and cancer for each of 16 patients. This 
dataset was filtered to remove genes with fewer than five 
reads in each diagnostic class, and normalized using 
SAM 2.0 as described above, to yield a dataset of 13,643 
genes for further analysis. Paired differential expression 
analysis was performed using SAMseq [13]. Briefly, count 
data are stochastically normalized sequencing depths 
using 'Poisson resampling' [13], and nonparametric statis- 
tics are calculated to measure the expression differences 
among classes. Next, permutations are performed to ob- 
tain the null distribution of these nonparametric statistics. 
It has been shown that compared with parametric me- 
thods (often based on negative-binomial distribution), 
SAMseq is more powerful and robust to noise when 
the sample size is moderate or large [39]. Differen- 
tially expressed genes at FDR <5% were identified and 
are reported in Tables S5 to S8 in Additional file 1 
(see Additional file 3 for scripts). 

Fluorescence in situ hybridization 

FISH was performed to examine chromosome lq gain 
using 4-|im FFPE sections from 10 of the early neoplasia 
samples (Table S8 in Additional file 1). Bacterial artificial 
chromosome (BAC) clones RP11-1044H13 (lq32) and 
RP11-1120 M18 (3q25) were obtained from the BACPAC 
Resources Centre (Children's Hospital Oakland Research 
Institute, Oakland, CA, USA), while clone CTD-2344 
F21 (2q37) was from Invitrogen/Life Technologies. Probe 
RP11-1044H13 was labeled with Cy3 dUTP (Amersham/ 
GE Healthcare, Pittsburgh, PA, USA) and control probes 
RP11-1120 M18 and CTD-2344 F21 were labeled with 



AlexaFluor 647-aha-dUTP (Invitrogen/Life Technologies, 
catalog #A32763) and Green dUTP (Abbot Molecular, 
Des Plaines, IL, USA; catalog #02 N32-050), respectively, 
using the Nick Translation Kit (Abbot Molecular). Slides 
were deparaffinized in xylene twice for 10 minutes, dehy- 
drated twice with 100% ethanol, air dried for 10 minutes, 
and then pretreated in 10 mM citric acid pH6 at 80°C for 
45 minutes. Slides were digested for 40 minutes in pepsin 
(75,000 units; Sigma- Aldrich, St Louis, MO, USA; cata- 
log #P6887) at 37°C. Fluorescent labeled probes and slides 
were co-denatured at 75°C for 7 minutes and hybridized 
at 37°C for 16 to 18 hours in a humidified chamber. Post- 
hybridization washes were performed using 2xSSC/0.3% 
NP-40 at 72°C for 5 minutes. Slides were dehydrated 
and air dried in the dark, and counterstained with DAPI 
(Invitrogen/Life Technologies, catalog #P36935). Imaging 
and analysis were performed using Ariol 3.4v software 
(Genetix/Leica Microsystems, Wetzlar, Germany). Fluo- 
rescence was scored visually using filters for 550 nm 
(green, Cy3), 647 nm (red, AlexaFluor 647), and 488 nm 
(yellow, Abbot Green). Total signals for each color within 
a given slide region were counted, regardless of nucleus. 
Signals from roughly 40 to 300 cells per sample were 
counted. Total green counts (lq32) were compared with 
red (3q25 control) and yellow (2q37 control) counts. 
Those samples with ratios 1.5 or greater were considered 
amplified for lq (Table S8 in Additional file 1). 

Immunohistochemistry 

Primary antibodies were directed toward FOXA1 (NBP1- 
49791, rabbit polyclonal, Novus Biologicals, Littleton, 
CO, USA), GATA3 (HG3-35, mouse monoclonal, Santa 
Cruz Biotechnology, Dallas, TX, USA), and ER (SP1, 
rabbit monoclonal, Abeam, Cambridge, MA, USA). Tissue 
microarrays (Stanford TA375, TA386, TA387, TA390, 
TA392, TA393) were constructed using a technique previ- 
ously described [40] with a tissue arrayer (Beecher Instru- 
ments, Silver Spring, MD, USA). We took 615 cores (0.6 
or 2.0 mm) from paraffin-embedded breast samples 
from 105 patients with early neoplasia archived with the 
Stanford University Medical Center between 2000 and 
2012. Multiple cores were taken for each patient. Cores 
often possessed more than one diagnosis, so total instances 
scored across the 615 cores represent: 105 normal, 138 
early neoplasia, 161 DCIS, and 84 IDC. ER and GATA3 
IHC was performed on 4 (im sections using the Ventana 
BenchMark XT automated immunostaining platform 
(Ventana Medical Systems/Roche, Tucson, AZ, USA). 
For FOXA1 IHC, sections of 4 \im were cut from the 
tissue array blocks, deparaffinized in xylene, hydrated 
in a graded series of alcohol, and prepared for staining 
using Target Retrieval Solution, Citrate pH6 (Dako/ 
Agilent, Carpinteria, CA, USA, catalog #S2369) to retrieve 
antigenic sites at 116°C for 3 minutes. Staining was then 
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performed using the EnVision + anti-rabbit system (Dako/ 
Agilent, catalog #K4011). Estimates of the fraction of 
cells staining positive within the nucleus of luminal 
cells in the epithelial compartment were made and scores 
were assigned as follows: FOXA1 '0', <50% cells; FOXA1 
T, 50 to 90% cells; FOXA1 '2', >90% cells. ER and GATA3 
'0', <20% cells; ER and GATA3 T, 20 to 90% cells; 
ER and GATA3 '2', >90% cells. When multiple scores 
were obtained for a given diagnosis within a patient, 
the median score was used as the representative score. In 
this way we obtained 58 independent instances of normal, 
61 early neoplasias, 91 cancer (or 88 DOS, 51 IDC when 
considered separately). Many of these represented 
complete patient-matched sets (Table S10 in Additional 
file 1). 



IncRNA analysis 

The peaks previously identified for 1,065 known IncRNAs 
and 1,071 novel transcripts were profiled by counting se- 
quencing reads falling within the previously defined 3SEQ 
peak coordinates [11]. A dataset of 48 samples comprising 
the 16 normal, early neoplasia, and cancer sets of patient- 
matched samples was filtered to remove transcripts with 
fewer than five reads in at least one diagnostic class, yield- 
ing 1,376 transcripts for further analysis. A series of paired 
SAMseq analyses was performed using the SAM 2.0 R 
package as described above [12]. Differentially expressed 
genes at FDR <5% were identified and are reported in 
Table S10 in Additional file 1 (see Additional file 3 for 
scripts). 

RNA in situ hybridization 

The RNA in situ hybridization probe for IGKC was de- 
signed against chr2: 88,937,790-88,938,290 (hgl8) using 
primer 5 -CTG TTG TGT GCC TGC TGA AT-3' and 
the T7 promoter-tagged primer 5'-CTA ATA CGA CTC 
ACT ATA GGG TTA AAG CCA AGG AGG AGG AG- 
3'. RNA in situ hybridizations for transcript 13741 
(probe described previously [11]) and IGKC were per- 
formed on TA375, as described previously [11]. 

DTF core gene analysis 

Normalized expression data for 61 of the 66 genes from 
the DTF core gene signature [20] were used to hierar- 
chically cluster the cancer, early neoplasia, and normal 
samples separately, using average linkage clustering of 
centered data with Cluster 3.0. Heatmaps were visual- 
ized using JavaTreeView. Samples were classified as 
DTF+, DFT-, or undetermined based on coordinated up- 
regulation of approximately 40 genes, as seen in Figure 5 
and Figure S3 in Additional file 2. 



Additional files 



Additional file 1: Table SI. contains a summary of the samples, 
including clinical and PIK3CA mutation status. Table S2. lists sequencing 
read counts and statistics for the 3SEQ libraries. Table S3, shows PAM 
analysis cross-validated probabilities for each sample. Table S4. shows 
classification genes identified by PAM analysis. Table S5. shows genes 
differentially expressed between normal and cancer, FDR <5%. Table S6. 
shows genes differentially expressed between early neoplasia and cancer, 
FDR <5%. Table S7. shows genes differentially expressed between nor- 
mal and early neoplasia, FDR <5%. The table includes information for 
which genes overlap genes differentially expressed in cancer, PAM 
classification genes, TCGA genes, Cancer Gene Census genes, and genes 
comprising the intrinsic breast cancer gene signature, PAM50 gene 
signature, and Genes-to-Systems Breast Cancer database. Normalized 
expression values for each of the 72 profiled samples are included. 
Table S8. shows FISH signal counts used in determining chromosome 
1q gain in early neoplasias. Table S9. shows genes differentially 
expressed between early neoplasias with chromosome 1q gain and wild- 
type early neoplasias, FDR <5%. Table S10. shows immunohistochemistry 
scores for ER, FOXA1, and GATA3 stains. Table S1 1 . shows differentially 
expressed IncRNAs and novel transcripts between normal tissue and early 
neoplasia, FDR <5%. Table SI 2. shows KEGG and PATHNER enrichment 
values for normal versus neoplasia genes. Table S13. shows significant 
genes between normal versus cancer for the RNAseq validation dataset 
based on r-test with a P-value cutoff of 0.05. Table S14. shows significant 
genes between early neoplasia versus cancer for the RNAseq validation 
dataset based on a r-test with a P-value cutoff of 0.05. Table S15. shows 
significant genes between normal versus early neoplasia for the RNAseq 
validation dataset based on a r-test with a P-value cutoff of 0.05. 

Additional file 2: Figure SI. plots FOXA1 and GATA3 IHC scores in 

relation to patient-matched ER scores. Figure S2. shows IGKC RNA in situ 
hybridization in normal tissue and early neoplasia. Figure S3, shows 
clustered heatmaps of DTF gene signatures for normal tissue, early 
neoplasias, and cancer, and includes gene and sample names. 

Additional file 3: Perl and R scripts used in analysis. See the 

documentation files for details. 

Additional file 4: Validation dataset for classification analysis. 
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